Marcus Dobrinkat Domain Adaptation in Statistical Machine Translation Systems via User Feedback
نویسنده
چکیده
Machine translation research has progressed in recent years thanks to statistical machine learning methods, sufficient computational power, open source tools and increasing availability of bilingual parallel text resources. However, most of these systems stay in the hands of researchers and are not improved with public users in mind. The motivation behind this thesis is the vision of freely available machine translation systems. They may be particularly important for languages and domains where there is not enough commercial interest for providing such services otherwise. The main focus of this work was to collect reference translations for Finnish news sentences, and to use this data to improve a baseline translation system on this news domain. A web application was created for rating and correcting translations and volunteers were invited to participate the effort. Then, three different approaches to domain adaptation were realized and evaluated using the news domain data. In particular, language and translation model interpolation and post-editing have been studied. Thanks to volunteers, a 1 000 sentence bilingual Finnish-English news corpus was assembled. The corpus is a good asset for further research in domain adaptation. The adaptation results show that a combination of language model and translation model interpolation effectively adapts the baseline system to the news domain. Using available domain adaptation methods, translation systems can be built with simple means and adjusted to the users' needs by community feedback. Acknowledgments I have written this thesis in the computational cognitive systems group, which I wish to thank for the warm welcome and the good, inspiring working atmosphere. I am especially thankful to Jaakko Väyrynen, who has been a devoted and obliging instructor. In our discussions and through your countless feedback, I have learned a lot and you helped me to focus. Next, I would like to thank my supervisor, Doc. Timo Honkela for his innovative ideas, feedback and help with searching a suitable topic. I am grateful to my colleagues Sami Virpioja, for helping me with Morfessor and to Janne Argillander for proofreading and interesting discussions. I also wish to thank IBM Finland for the financial support and work time flexibility, which permitted me to write this thesis. My family has provided me with constant help and love. I am especially grateful to my wife Sonja, who managed to support me during this long one year. Thanks to Ole and Synne, who spent countless hours taking care of their grandchildren, and …
منابع مشابه
Continuous Adaptation to User Feedback for Statistical Machine Translation
This paper gives a detailed experiment feedback of different approaches to adapt a statistical machine translation system towards a targeted translation project, using only small amounts of parallel in-domain data. The experiments were performed by professional translators under realistic conditions of work using a computer assisted translation tool. We analyze the influence of these adaptation...
متن کاملDomain Adaptation for Statistical Machine Translation of Corporate and User-Generated Content
xi Acknowledgements xii
متن کاملDomain Adaptation in Statistical Machine Translation of User-Forum Data using Component-Level Mixture Modelling
This paper reports experiments on adapting components of a Statistical Machine Translation (SMT) system for the task of translating online user-generated forum data from Symantec. Such data is monolingual, and differs from available bitext MT training resources in a number of important respects. For this reason, adaptation techniques are important to achieve optimal results. We investigate the ...
متن کاملDomain Adaptation via Pseudo In-Domain Data Selection
We explore efficient domain adaptation for the task of statistical machine translation based on extracting sentences from a large generaldomain parallel corpus that are most relevant to the target domain. These sentences may be selected with simple cross-entropy based methods, of which we present three. As these sentences are not themselves identical to the in-domain data, we call them pseudo i...
متن کاملA Novel Approach to Trace Time-Domain Trajectories of Power Systems in Multiple Time Scales Based Flatness
This paper works on the concept of flatness and its practical application for the design of an optimal transient controller in a synchronous machine. The feedback linearization scheme of interest requires the generation of a flat output from which the feedback control law can easily be designed. Thus the computation of the flat output for reduced order model of the synchronous machine with simp...
متن کامل